A Report of Recent Progress in Transformation-Based Error-Driven Learning
نویسنده
چکیده
Most recent research in trainable part of speech taggers has explored stochastic tagging. While these taggers obtain high accuracy, linguistic information is captured indirectly, typically in tens of thousands of lexical and contextual probabilities. In [Brill 92], a trainable rule-based tagger was described that obtained performance comparable to that of stochastic taggers, but captured relevant linguistic information in a sma]_l number of simple non-stochastic rules. In this paper, we describe a number of extensions to this rule-based tagger. First, we describe a method for expressing lexical relations in tagging that stochastic taggers are currently unable to express. Next, we show a rule-based approach to tagging unknown words. Finally, we show how the tagger can be extended into a k-best tagger, where multiple tags can be assigned to words in some cases of uncertainty. Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, Massachusetts 02139 / that achieves performance comparable to that of stochastic taggers. Training this tagger is fully automated , but unlike trainable stochastic taggers, linguistic information is encoded directly in a set of simple non-stochastic rules. In this paper, we describe some extensions to this rulebased tagger. These include a rule-based approach to: lexicalizing the tagger, tagging unknown words, and assigning the k-best tags to a word. All of these extensions, as well as the original tagger, are based upon a learning paradigm called t ransformation-based error-driven learning. This learning paradigm has shown promise in a number of other areas of natural language processing, and we hope that the extensions to t ransformat ion-based learning described in this paper can carry over to other domains of application as well. 2 1. I N T R O D U C T I O N When au tomated par t of speech tagging was initially explored [Klein and Simmons 63, Harris 62], people manually engineered rules for tagging, sometimes with the aid of a corpus. As large corpora became available, it became clear tha t simple Markov-model based stochastic taggers that were automat ica l ly trained could achieve high rates of tagging accuracy [Jelinek 85]. These stochastic taggers have a number of advantages over the manual ly built taggers, including obviating the need for laborious manual rule construction, and possibly capturing useful information tha t may not have been noticed by the human engineer. However, stochastic taggers have the disadvantage that linguistic information is only captured indirectly, in large tables of statistics. Almost all recent work in developing automat ical ly trained par t of speech taggers has been on further exploring Markovmodel based tagging [Jetinek 85, Church 88, DeRose 88, DeMarcken 90, Merialdo 91, Cutt ing et al. 92, Kupiec 92, Charniak et al. 93, Weischedel et al. 93]. 1 In [Brill 92], a trainable rule-based tagger is described *This research was supported by ARPA under contract N0001489-J-1332, monitored through the Office of Naval Research. 1Markov-model based taggers assign a sentence the tag sequence that maximizes Prob(word[tag) * Prob(taglprevious n tags). 256 2. T R A N S F O R M A T I O N B A S E D E R R O R D R I V E N L E A R N I N G Transformation-based error-driven learning has been applied to a number of natural language problems, including par t of speech tagging, prepositional phrase at tachment disambiguation, and syntactic parsing [Brill 92, Brill 93, Brill 93a]. A similar approach is being explored for machine translat ion [Su et al. 92]. Figure 1 illustrates the learning process. First, unannota ted text is passed through the initial-state annotator . The initials tate annotator can range in complexity f rom assigning random structure to assigning the output of a sophisticated manual ly created annotator . Once text has been passed through the initial-state annotator , it is then compared to the truth, 3 and t ransformations are learned that can be applied to the output of the initial s ta te annotator to make it bet ter resemble the truth. In all of the applications described in this paper, the following greedy search is applied: at each iteration of learning, the t ransformat ion is found whose application resuits in the highest score; that t ransformat ion is then added to the ordered t ransformat ion list and the training corpus is updated by applying the learned t ransformation. To define a specific application of t ransformation2The programs described in this paper are freely available. 3As specified in a manually annotated corpus. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 1994 2. REPORT TYPE 3. DATES COVERED 00-00-1994 to 00-00-1994 4. TITLE AND SUBTITLE A Report of Recent Progress in Transformation-Based Error-Driven Learning 5a. CONTRACT NUMBER
منابع مشابه
Robust Control of Electrically Driven Robots in the Task Space
In this paper, a task-space controller for electrically driven robot manipulators is developed using a robust control algorithm. The controller is designed using voltage control strategy. Based on the nominal model of the robotic arm, the desired signals for motor currents are calculated and then the voltage control law is proposed based on the current errors and motor nominal electrical model....
متن کاملRobust Control of Electrically Driven Robots in the Task Space
In this paper, a task-space controller for electrically driven robot manipulators is developed using a robust control algorithm. The controller is designed using voltage control strategy. Based on the nominal model of the robotic arm, the desired signals for motor currents are calculated and then the voltage control law is proposed based on the current errors and motor nominal electrical model....
متن کاملConcordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کاملTransformation-based Bracketing: Fast Algorithms and Experimental Results
Abstract : In this paper we present an empirical study of transformation-based error-driven learning applied to the bracketing of sentences. We introduce a series of fast algorithms we have developed to learn bracketing rules; these algorithms allow rapid learning of rule sequences and are a significant improvement over previous transformation-based learning algorithms. We describe systematic e...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994